In [4]:
from IPython.display import Image 
In [5]:
Image(filename = 'Business Case.jpg')
Out[5]:

Business Case

For A logistic Company which has all types of vehicles how can it reduce the cost of damage to vehicle and drivers with the given weather, vehicle and driver data

Problem Statement: To reduce the loss to company which happens due to aggressive drivers to logistic companies. This happens due to various reasons like weather conditions or lack of sleep or due to not maintaining the vehicle properly and this list of factors can go on. Breaks need to be respected as an important resting time for drivers, on the other hand, unscheduled or long breaks can impact your schedule and result in unhappy customers. Wear and tear can be caused by multiple factors, from aggressive driving to an inefficient maintenance system. Due to natural disaster or the extreme weather conditions can further increase the loss of the companies

Business understanding

Till now there were no methods to identify the aggressive drivers which were on-route for delivery. But with the help of techniques of machine learning now a driver can be identifed if he is a aggressive,normal or vague driver. The data is available in three different files the driver data, vehicle data and weather data combining all of them together will give us better insight about the different factors behind the aggressivness of the driver.

Taking only the required libraries for the visualization

In [68]:
import matplotlib.pyplot as plt
import pandas as pd
from numpy import *
import numpy as np
import seaborn as sns
import plotly
import plotly.offline as pyoff
import plotly.figure_factory as ff
from plotly.offline import init_notebook_mode, iplot, plot
import plotly.graph_objs as go
%matplotlib inline
from tqdm import tqdm_notebook
import random
import math
import pandas_profiling
import warnings
warnings.filterwarnings('ignore')
In [7]:
init_notebook_mode(connected=True)

Reading Provided the dataset

some NA values can't just 'NA' there could be a white space or a question mark

na_values=[" ",".","NA","?","-",""] so using in this way we can specify possible list of NA values

In [8]:
train_data = pd.read_csv('Train.csv',na_values=[" ",".","NA","?","-",""])
train_vehicle = pd.read_csv('Train_Vehicletravellingdata.csv',na_values=[" ",".","NA","?","-",""])
train_weather = pd.read_csv('Train_WeatherData.csv',na_values=[" ",".","NA","?","-",""])

First we will take The Train.csv file and do the basic observation of the data like info and shape as well as count of null values

We will also rename the columns for easy interpretation

Let's take one dataset at a time: Train data

ID :"ID" V2 :"vehicle_length_cm" V5 :"vehicle_weight_kg" V6 :"number_of_axles" DrivingStyle :"DrivingStyle"(Target)

In [9]:
print(train_data.shape)
train_data.isnull().sum()
(12994, 5)
Out[9]:
ID              0
V2              0
V5              0
V6              0
DrivingStyle    0
dtype: int64
In [10]:
train_data.columns = ["ID","vehicle_length_cm","vehicle_weight_kg","number_of_axles","DrivingStyle"]
In [11]:
train_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12994 entries, 0 to 12993
Data columns (total 5 columns):
ID                   12994 non-null object
vehicle_length_cm    12994 non-null int64
vehicle_weight_kg    12994 non-null int64
number_of_axles      12994 non-null int64
DrivingStyle         12994 non-null int64
dtypes: int64(4), object(1)
memory usage: 507.7+ KB

from above output it is evident that there is no null value

Let's take the Second dataset: Vehicle data

We will perform the same operations as above

In [12]:
train_vehicle.columns = ["ID","trip_datetime","lane_no","vehicle_speed","pvehicle_id","pvehicle_speed_kph","pvehicle_weight_kg","pvehicle_length_cm","pvehicle_timegap","weather_road_cond"]

Column pvehicle_timegap has 2455 null values

In [13]:
# checking null values in train_vehicle
print(train_vehicle.shape)
train_vehicle.isnull().sum()
(162566, 10)
Out[13]:
ID                       0
trip_datetime            0
lane_no                  0
vehicle_speed            0
pvehicle_id              0
pvehicle_speed_kph       0
pvehicle_weight_kg       0
pvehicle_length_cm       0
pvehicle_timegap      2455
weather_road_cond        0
dtype: int64

for pvehicle_timegap we will do a mean imputation. But the value generated by mean calculation will be in floating point, hence converting the mean value to the closest value by converting it to integer datatype

In [14]:
train_vehicle['pvehicle_timegap'] = train_vehicle['pvehicle_timegap'].fillna(train_vehicle['pvehicle_timegap'].mean())
In [15]:
train_vehicle['pvehicle_timegap'] = train_vehicle['pvehicle_timegap'].astype('int64')

We will also change the datatype of trip_datetime to datetime64 so that we can use that in future as and when required

In [16]:
train_vehicle['trip_datetime'] = train_vehicle['trip_datetime'].astype('datetime64')
In [17]:
train_vehicle.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 162566 entries, 0 to 162565
Data columns (total 10 columns):
ID                    162566 non-null object
trip_datetime         162566 non-null datetime64[ns]
lane_no               162566 non-null int64
vehicle_speed         162566 non-null int64
pvehicle_id           162566 non-null int64
pvehicle_speed_kph    162566 non-null int64
pvehicle_weight_kg    162566 non-null int64
pvehicle_length_cm    162566 non-null int64
pvehicle_timegap      162566 non-null int64
weather_road_cond     162566 non-null object
dtypes: datetime64[ns](1), int64(7), object(2)
memory usage: 12.4+ MB

Let's take the Second dataset: Weather data

We will perform the same operations here as well

In [18]:
train_weather.columns = ["ID",'trip_datetime','air_temp','prep_type','prep_intensity','realtive_humidity','wind_direction','wind_speed_ms','daylight_cond']
In [19]:
train_weather.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 162566 entries, 0 to 162565
Data columns (total 9 columns):
ID                   162566 non-null object
trip_datetime        162566 non-null object
air_temp             160509 non-null float64
prep_type            162566 non-null object
prep_intensity       160292 non-null object
realtive_humidity    160461 non-null float64
wind_direction       160452 non-null float64
wind_speed_ms        160102 non-null float64
daylight_cond        162566 non-null object
dtypes: float64(4), object(5)
memory usage: 11.2+ MB
In [20]:
# checking null values in train_weather
print(train_weather.shape)
train_weather.isnull().sum()
(162566, 9)
Out[20]:
ID                      0
trip_datetime           0
air_temp             2057
prep_type               0
prep_intensity       2274
realtive_humidity    2105
wind_direction       2114
wind_speed_ms        2464
daylight_cond           0
dtype: int64

Mode imputation for prep_intensity and for the rest of the null value columns we will do mean Imputation and try to bring them to their nearest value by using ceil and floor functions

In [21]:
train_weather['air_temp'] = train_weather['air_temp'].fillna(math.ceil(train_weather['air_temp'].mean()))  #ceil
train_weather['realtive_humidity'] = train_weather['realtive_humidity'].fillna(math.ceil(train_weather['realtive_humidity'].mean()))
train_weather['wind_direction'] = train_weather['wind_direction'].fillna(180) #ceil
train_weather['wind_speed_ms'] = train_weather['wind_speed_ms'].fillna(math.floor(train_weather['wind_speed_ms'].mean()))
In [22]:
train_weather['prep_intensity'] = train_weather['prep_intensity'].astype('category')
In [23]:
# train_weather['prep_intensity'].mode()
In [24]:
train_weather['prep_intensity'] = train_weather['prep_intensity'].fillna(train_weather['prep_intensity'].mode()[0])
In [25]:
train_weather.isnull().sum()
Out[25]:
ID                   0
trip_datetime        0
air_temp             0
prep_type            0
prep_intensity       0
realtive_humidity    0
wind_direction       0
wind_speed_ms        0
daylight_cond        0
dtype: int64
In [26]:
train_weather['trip_datetime'] = train_weather['trip_datetime'].astype('datetime64')

The basic preprocessing finishes here now we will join all the dataset using 'Inner' Join

The Inner join between the train data and vehicle data will be done on the basis of 'ID' column

The Inner join between the joined data and weather data will be done on the basis of 'ID' and 'trip_datetime' column

In [27]:
data = pd.merge(train_data,train_vehicle,on='ID',how='inner')
print(data.shape)
data = pd.merge(data,train_weather,on=['ID','trip_datetime'],how='inner')
print(data.shape)
(162566, 14)
(162566, 21)
In [28]:
# for efficient memory usage
del train_data
del train_vehicle
del train_weather

converting all the object data type columns to categories

In [29]:
for column in data.columns:
    if data[column].dtypes == 'object':
        data[column] = data[column].astype('category')

checking the summary statistics for all the columns

In [30]:
data.describe(include='all').T
Out[30]:
count unique top freq first last mean std min 25% 50% 75% max
ID 162566 12994 DR_12145 112 NaN NaN NaN NaN NaN NaN NaN NaN NaN
vehicle_length_cm 162566 NaN NaN NaN NaN NaN 865.518 495.156 155 550 577 1060 2337
vehicle_weight_kg 162566 NaN NaN NaN NaN NaN 6020.27 7972.93 44 1625 2013 6220 57230
number_of_axles 162566 NaN NaN NaN NaN NaN 2.84631 1.46656 2 2 2 3 9
DrivingStyle 162566 NaN NaN NaN NaN NaN 2.14726 0.668102 1 2 2 3 3
trip_datetime 162566 162537 2012-05-02 01:20:46 2 2012-03-21 09:14:55 2013-04-30 16:57:08 NaN NaN NaN NaN NaN NaN NaN
lane_no 162566 NaN NaN NaN NaN NaN 1.50085 0.500001 1 1 2 2 2
vehicle_speed 162566 NaN NaN NaN NaN NaN 83.4555 9.37512 8 78 83 88 161
pvehicle_id 162566 NaN NaN NaN NaN NaN 460124 272271 20 142983 594322 692363 794435
pvehicle_speed_kph 162566 NaN NaN NaN NaN NaN 83.4588 9.37312 0 78 83 88 161
pvehicle_weight_kg 162566 NaN NaN NaN NaN NaN 5017.56 7399.32 3 1502 1862 2669 69548
pvehicle_length_cm 162566 NaN NaN NaN NaN NaN 790.775 481.944 102 527 560 701 2981
pvehicle_timegap 162566 NaN NaN NaN NaN NaN 105.335 175.859 1 7 45 123 1797
weather_road_cond 162566 4 Dry 117666 NaN NaN NaN NaN NaN NaN NaN NaN NaN
air_temp 162566 NaN NaN NaN NaN NaN 4.65861 3.20645 -13 2 5 7 24
prep_type 162566 3 clear 151259 NaN NaN NaN NaN NaN NaN NaN NaN NaN
prep_intensity 162566 4 None 153849 NaN NaN NaN NaN NaN NaN NaN NaN NaN
realtive_humidity 162566 NaN NaN NaN NaN NaN 60.6521 18.1543 16 46 58 76 97
wind_direction 162566 NaN NaN NaN NaN NaN 182.429 88.3482 6 152 180 208 360
wind_speed_ms 162566 NaN NaN NaN NaN NaN 4.18579 3.00465 0 1 4 7 17
daylight_cond 162566 3 night 95925 NaN NaN NaN NaN NaN NaN NaN NaN NaN

Checking the corealtion of all the columns with each other

In [31]:
f, ax = plt.subplots(figsize=(13,9))
sns.heatmap(data.corr(),annot=True)
Out[31]:
<matplotlib.axes._subplots.AxesSubplot at 0x18824ef0>

so from above graph it is evident that features have less correlation with each other, which means each feature has lot of variance. Further check the values between vehicle_length_cm vs pvehicle_length_cm

we'll try to further disect down the time_gap columns into following: Day, Month, Year, Hour, Minute, Second

In [32]:
data['trip_date'] = data['trip_datetime'].dt.date
data['trip_date_day'] = data['trip_datetime'].dt.day
data['trip_month'] = data['trip_datetime'].dt.month
data['trip_year'] = data['trip_datetime'].dt.year
data['trip_time'] = data['trip_datetime'].dt.time
data['trip_hour'] = data['trip_datetime'].dt.hour
data['trip_minute'] = data['trip_datetime'].dt.minute
data['trip_sec'] = data['trip_datetime'].dt.second

Calculating the speed differnce between the current vehicle and the preceeding vehicle

In [33]:
data['speed_difference'] = data['vehicle_speed'] - data['pvehicle_speed_kph']

since we have the speed and timegap between the current vehicle and preceeding vehicle we will make of this formula to calculate the distance between the current and the preceeding vehicle

                                           speed=distance/time
In [34]:
data['distance'] = (data['pvehicle_speed_kph']/3.6)*data['pvehicle_timegap']
In [35]:
data['distance'] = data['distance'].astype('int64')
In [36]:
# calculating the bin size
bins = list(range(0,data['distance'].max()+10,18613))
In [37]:
labels = ['Low','Medium','High']
data['binned_distance'] = pd.cut(data['distance'], bins=bins, labels=labels)
data['binned_distance'].value_counts()
Out[37]:
Low       160288
Medium      2078
High         199
Name: binned_distance, dtype: int64
In [38]:
data.columns
Out[38]:
Index(['ID', 'vehicle_length_cm', 'vehicle_weight_kg', 'number_of_axles',
       'DrivingStyle', 'trip_datetime', 'lane_no', 'vehicle_speed',
       'pvehicle_id', 'pvehicle_speed_kph', 'pvehicle_weight_kg',
       'pvehicle_length_cm', 'pvehicle_timegap', 'weather_road_cond',
       'air_temp', 'prep_type', 'prep_intensity', 'realtive_humidity',
       'wind_direction', 'wind_speed_ms', 'daylight_cond', 'trip_date',
       'trip_date_day', 'trip_month', 'trip_year', 'trip_time', 'trip_hour',
       'trip_minute', 'trip_sec', 'speed_difference', 'distance',
       'binned_distance'],
      dtype='object')

designing a custom color pallette for plotly graphs

In [39]:
N = 30
c = ['hsl('+str(h)+',50%'+',50%)' for h in linspace(0, 360, N)]
In [40]:
Image(filename = 'QnA_header.jpg')
Out[40]:

With each visual graphs I was getting the answer to many of my questions

Univariate Analysis

Visualizing the categorical columns

Q1. What are the Distributions of the following attributes

  1. Distribution of number of axles
  2. Distribution of Driving Style
  3. Weather Road Condition
In [41]:
def barplot(temp,column_name):
    plot_data = [go.Bar(
                x= temp.index,
                y= temp.values,
                text = temp.values,
                textposition = 'auto',
                marker = dict(color=c[np.random.random_integers(N)])
        )]
    layout = go.Layout(
        autosize=True,
        title = "Distribution of"+" "+column_name,
    )

    fig = go.Figure(data=plot_data, layout=layout)
    iplot(fig)
    del temp
  1. Distribution of number of axles: As we can can see from the distribution the 2 axle vvehicles are more dominant then the rest of the vehicle. This gives us the insight that the company might be atleast 10 years old and could be present in various countries.
  2. Distribution of Driving Style: From the distribution 1 denotes the Aggressive drivers whereas 2 denotes the normal drivers and 3 denotes the vague drivers. By looking at the distribution we get the information that Aggressive drivers are in minority class
  3. Weather Road Condition: The graph contains Dry,Wet,Visible track and Snow covered. Most of the drivers tend to avoid bad road conditions
In [42]:
bar_graph_columns = ['number_of_axles','DrivingStyle','weather_road_cond']
for column in bar_graph_columns:
    barplot(data[column].value_counts(),column)

Q2. What is the Min, Max, Medians of the following Attributes

  1. Length of the vehicle
  2. Weight of the vehicle
  3. Preceeding vehicle weight
  4. Preceeding vehicle length
In [43]:
def boxplot(temp,column_name):
    plot_data = [go.Box(
    y = data['vehicle_length_cm'],
    name = column_name,
    marker = dict(color=c[np.random.random_integers(N)]))]
    layout = go.Layout(title = "Boxplot of"+" "+column_name)
    fig = go.Figure(data= plot_data, layout=layout)
    iplot(fig)
    del temp
  1. Length of the vehicle : With the length distribution we can say that as the 2 axle vehicles are more dominant thus the minimum weight of the vehicle is 155cm and the median is 577cm. So the comapny has wide range of vehicles
  2. Weight of the vehicle: The weight of the vehicle is also conveying the same story
  3. Preceeding vehicle weight,Preceeding vehicle length: Through these two graph we can certainly say one thing that the drivers are mostly driving in the Highways.
In [44]:
boxplot_colums = ['vehicle_length_cm','vehicle_weight_kg', 'pvehicle_weight_kg', 'pvehicle_length_cm']
for column in boxplot_colums:
    temp = data[column]
    boxplot(temp,column)

Bivariate Analysis

In [45]:
# Number of axles vs Driving style

plot_data = []
temp = data.groupby(['DrivingStyle','number_of_axles']).size().to_frame()
temp = temp.reset_index()
temp.columns = ['DrivingStyle','number_of_axles','Count']
# temp
temp['number_of_axles'].value_counts()
for i in np.sort(temp['number_of_axles'].unique()):
    trace = go.Bar(x = temp.DrivingStyle[temp.number_of_axles==i],
                   y = temp.Count[temp.number_of_axles==i],
                    text = temp.Count[temp.number_of_axles==i],
                    textposition = 'auto',
                   name = str(i))
    plot_data.append(trace)
layout = go.Layout(title = 'Driving Style vs Number of axles',
                  yaxis = dict(title='Count'),
                  xaxis = dict(title='Driving Style'),
                  barmode='stack')
fig = go.Figure(data=plot_data, layout=layout)
iplot(fig)

on the x-axis we have the Driving Style and on the y-axis we have the count. In the above graph we are now getting more insights of the Aggressive drivers that aggressiving driving is independent of the vehicle

Q4. How different types of drivers are driving in different road conditions

In [46]:
plot_data = []
temp = data.groupby(['DrivingStyle','weather_road_cond']).size().to_frame()
temp = temp.reset_index()
temp.columns = ['DrivingStyle','weather_road_cond','Count']
# temp
temp['weather_road_cond'].value_counts()
for i in np.sort(temp['weather_road_cond'].unique()):
    trace = go.Bar(x = temp.DrivingStyle[temp.weather_road_cond==i],
                   y = temp.Count[temp.weather_road_cond==i],
                    text = temp.Count[temp.weather_road_cond==i],
                    textposition = 'auto',
                   name = str(i))
    plot_data.append(trace)
layout = go.Layout(title = 'Driving Style vs Road Condition',
                  yaxis = dict(title='Count'),
                  xaxis = dict(title='Driving Style'),
                  barmode='stack')
fig = go.Figure(data=plot_data, layout=layout)
iplot(fig)

In the above distribution we got to know that aggrressive drivers also driver very rashly in wet road conditions.

Q5. How driving Style is getting affected by temperature and relative humidity

In [47]:
filter3=data.loc[data['DrivingStyle'] == 1]
sns.lmplot(x='realtive_humidity', y='air_temp', hue='DrivingStyle',aspect=2,
           data=data[['realtive_humidity','air_temp','DrivingStyle']],fit_reg=False,size=9)
Out[47]:
<seaborn.axisgrid.FacetGrid at 0x194c7eb8>

there are 2 stories here:

  1. Either air_temp or relative humidity aren't the challenging factors for aggressive drivers
  2. Or the driving style of particular driver is considered as aggressive in those air_temp and realtive humidity

Q6. Is there any relationship between the vehicle speed and preceeding vehicle speed

In [48]:
cmap = sns.cubehelix_palette(light=1, as_cmap=True)
sns.kdeplot(data['vehicle_speed'], data['pvehicle_speed_kph'], cmap=cmap, shade=True);
# sns.lmplot(x='vehicle_speed', y='pvehicle_speed_kph',aspect=2,
#            data=data[['vehicle_speed','pvehicle_speed_kph']],fit_reg=False,size=9)

The above plot tells us that the speed of the preceeding vehicle and the vehicle are in a range of 70-100. Since the driver is driving in the highway area which will be having speed limits in this case the speed limit is 100

Q7. What is the speed of vehicle and it's preceeding vehicle in a particular lane

In [49]:
sns.lmplot(x='vehicle_speed', y='pvehicle_speed_kph',hue='lane_no',aspect=2,
           data=data[['vehicle_speed','pvehicle_speed_kph','lane_no']],fit_reg=False,size=9)
Out[49]:
<seaborn.axisgrid.FacetGrid at 0x18685978>

The drivers in the lane 1 are driving fast as compared to lane to and the number of vehicles in lane 1 are more as compared to lane 2

Q8. Is the Driving style of the driver dependent on the time

In [50]:
plot_data = []
temp = data.groupby(['trip_hour','DrivingStyle']).size().to_frame()
temp = temp.reset_index()
temp.columns = ['trip_hour','DrivingStyle','Count']
# temp
temp['DrivingStyle'].value_counts()
for i in np.sort(temp['DrivingStyle'].unique()):
    trace = go.Bar(x = temp.trip_hour[temp.DrivingStyle==i],
                   y = temp.Count[temp.DrivingStyle==i],
                    text = temp.Count[temp.DrivingStyle==i],
                    textposition = 'auto',
                   marker = dict(color=c[np.random.random_integers(N)]),
                   name = str(i))
    plot_data.append(trace)
layout = go.Layout(title = 'Hour vs Driving Style',
                  yaxis = dict(title='Count'),
                  xaxis = dict(title='Driving Style'),
                  barmode='stack')
fig = go.Figure(data=plot_data, layout=layout)
iplot(fig)

It is evident from the above graph:

  1. In the morning time the drivers drive normally, but as the day passes the aggressivness of the driver's increases
  2. This gives us the insight of traffic at the highway at that particular hour.

Q9. Is Daylight Condition affecting the drivers driving Style?

In [51]:
plot_data = []
temp = data.groupby(['daylight_cond','DrivingStyle']).size().to_frame()
temp = temp.reset_index()
temp.columns = ['daylight_cond','DrivingStyle','Count']
# temp
temp['DrivingStyle'].value_counts()
for i in np.sort(temp['DrivingStyle'].unique()):
    trace = go.Bar(x = temp.daylight_cond[temp.DrivingStyle==i],
                   y = temp.Count[temp.DrivingStyle==i],
                    text = temp.Count[temp.DrivingStyle==i],
                    textposition = 'auto',
                   marker = dict(color=c[np.random.random_integers(N)]),
                   name = str(i))
    plot_data.append(trace)
layout = go.Layout(title = 'Day Light Condition vs Driving Style',
                  yaxis = dict(title='Count'),
                  xaxis = dict(title='Driving Style'),
                  barmode='stack')
fig = go.Figure(data=plot_data, layout=layout)
iplot(fig)

This is the proof that if in the particular daylight conditions the drivers tend to drive aggressively.

Q10. What are the daylight conditions at given hour in the day

In [67]:
plot_data = []
temp = data.groupby(['daylight_cond','trip_hour']).size().to_frame()
temp = temp.reset_index()
temp.columns = ['daylight_cond','trip_hour','Count']
# temp
temp['daylight_cond'].value_counts()
for i in np.sort(temp['daylight_cond'].unique()):
    trace = go.Bar(x = temp.trip_hour[temp.daylight_cond==i],
                   y = temp.Count[temp.daylight_cond==i],
                    text = temp.Count[temp.daylight_cond==i],
                    textposition = 'auto',
                   marker = dict(color=c[np.random.random_integers(N)]),
                   name = str(i))
    plot_data.append(trace)
layout = go.Layout(title = 'Day Light Condition vs Driving Style',
                  yaxis = dict(title='Count'),
                  xaxis = dict(title='Driving Style'),
                  barmode='stack')
fig = go.Figure(data=plot_data, layout=layout)
iplot(fig)

With the above data and the knowledge from this website(https://www.accuweather.com/en/weather-news/five-world-capitals-shortest-daylight/41734413) we can say that the country is near to the northern hemisphere.

Q11. Is vehicle weight a constraint for Aggressive drivers?

In [64]:
plot_data = []
# for i in np.sort(data.DrivingStyle.unique()):
trace1 = go.Box(y = data.vehicle_weight_kg[data.DrivingStyle==1],
                        marker = dict(color=c[np.random.random_integers(N)],outliercolor = 'rgba(219, 64, 82, 0.6)'),
                        name = "Aggrressive")
trace2 = go.Box(y = data.vehicle_weight_kg[data.DrivingStyle==2],
                        marker = dict(color=c[np.random.random_integers(N)],outliercolor = 'rgba(219, 64, 82, 0.6)'),
                        name = "Normal")
trace3 = go.Box(y = data.vehicle_weight_kg[data.DrivingStyle==3],
                        marker = dict(color=c[np.random.random_integers(N)],outliercolor = 'rgba(219, 64, 82, 0.6)'),
                        name = "Vague")
plot_data = [trace1,trace2,trace3]
layout = go.Layout(
autosize=True, # auto size the graph? use False if you are specifying the height and width
# width=1000, # height of the figure in pixels
# height=600, # height of the figure in pixels
title = "Boxplot of {} column based on {} ".format('Vehicle_weight','Driving Style'), # title of the figure
# more granular control on the title font 
    titlefont=dict( 
        family='Courier New, monospace', # font family
        size=14, # size of the font
        color='black' # color of the font
    ),
#     granular control on the axes objects 
    xaxis=dict( 
        title='Driving Style',
    tickfont=dict(
        family='Courier New, monospace', # font family
        size=14, # size of ticks displayed on the x axis
        color='black'  # color of the font
    )
),
yaxis=dict(
    title='Weight of Vehicle',
    titlefont=dict(
        size=14,
        color='black'
    ),
    tickfont=dict(
        family='Courier New, monospace', # font family
        size=14, # size of ticks displayed on the y axis
        color='black' # color of the font
    )
),
)
fig = go.Figure(data=plot_data, layout=layout)
iplot(fig)  
#weights dosent matter
# outliers missclassified for aggressive(ceo salary)

With the above graph we can say few drivers with bigger vehicle also drive aggressively

Q12. What is the difference between the distance of vehicle and preceeding vehicle across different segments of Drivers

In [55]:
plot_data = []
# for i in np.sort(data.DrivingStyle.unique()):

trace1 = go.Box(y = data.speed_difference[data.DrivingStyle==1],
                        marker = dict(color=c[np.random.random_integers(N)],outliercolor = 'rgb(9,56,125)'),
                        name = "Aggrressive")
trace2 = go.Box(y = data.speed_difference[data.DrivingStyle==2],
                        marker = dict(color=c[np.random.random_integers(N)],outliercolor = 'rgb(9,56,125)'),
                        name = "Normal")
trace3 = go.Box(y = data.speed_difference[data.DrivingStyle==3],
                        marker = dict(color=c[np.random.random_integers(N)],outliercolor = 'rgb(9,56,125)'),
                        name = "Vague")
plot_data = [trace1,trace2,trace3]
layout = go.Layout(
autosize=True, # auto size the graph? use False if you are specifying the height and width
# width=1000, # height of the figure in pixels
# height=600, # height of the figure in pixels
title = "Boxplot of {} column based on {} ".format('Speed Difference','Driving Style'), # title of the figure
# more granular control on the title font 
    titlefont=dict( 
        family='Courier New, monospace', # font family
        size=14, # size of the font
        color='black' # color of the font
    ),
#     granular control on the axes objects 
    xaxis=dict( 
        title='Driving Style',
    tickfont=dict(
        family='Courier New, monospace', # font family
        size=14, # size of ticks displayed on the x axis
        color='black'  # color of the font
    )
),
yaxis=dict(
    title='Speed difference',
    titlefont=dict(
        size=14,
        color='black'
    ),
    tickfont=dict(
        family='Courier New, monospace', # font family
        size=14, # size of ticks displayed on the y axis
        color='black' # color of the font
    )
),
)
fig = go.Figure(data=plot_data, layout=layout)
iplot(fig)   

The distance between the preceeding vehicle and the vehicle is normally distributed as the speed of the vehicle and its precedding vehicle is also normally distributed

Q13. What is the different Driving Style and distance gap between preceeding vehicle and the vehicle

In [63]:
plot_data = []
temp = data.groupby(['DrivingStyle','binned_distance']).size().to_frame()
temp = temp.reset_index()
temp.columns = ['DrivingStyle','binned_distance','Count']
# temp
temp['binned_distance'].value_counts()
for i in np.sort(temp['binned_distance'].unique()):
    trace = go.Bar(x = temp.DrivingStyle[temp.binned_distance==i],
                   y = temp.Count[temp.binned_distance==i],
                    text = temp.Count[temp.binned_distance==i],
                    textposition = 'auto',
                   marker = dict(color=c[np.random.random_integers(N)]),
                   name = str(i))
    plot_data.append(trace)
layout = go.Layout(title = 'Driving Style vs binned_distance',
                  yaxis = dict(title='Count'),
                  xaxis = dict(title='Driving Style'),
                  barmode='stack')
fig = go.Figure(data=plot_data, layout=layout)
iplot(fig)

The distance between the vehicle and its preceeding vehicle is not much. Here the driving rules at highways can be made more strict.

Q14. The distance gap distribution across the different axles of vehicles

In [61]:
data.number_of_axles
plot_data = []
temp = data.groupby(['number_of_axles','binned_distance']).size().to_frame()
temp = temp.reset_index()
temp.columns = ['number_of_axles','binned_distance','Count']
# temp
temp['binned_distance'].value_counts()
for i in np.sort(temp['binned_distance'].unique()):
    trace = go.Bar(x = temp.number_of_axles[temp.binned_distance==i],
                   y = temp.Count[temp.binned_distance==i],
                    text = temp.Count[temp.binned_distance==i],
                    textposition = 'auto',
                   marker = dict(color=c[np.random.random_integers(N)]),
                   name = str(i))
    plot_data.append(trace)
layout = go.Layout(title = 'Driving Style vs binned_distance',
                  yaxis = dict(title='Count'),
                  xaxis = dict(title='Driving Style'),
                  barmode='stack')
fig = go.Figure(data=plot_data, layout=layout)
iplot(fig)

Clearly the data of 2 vehicle is the most and we have the less dat for other vehicle data. In future we can collect more data with different axle of vehicle to do more in depth study

In [60]:
data.number_of_axles
plot_data = []
temp = data.groupby(['trip_month','DrivingStyle']).size().to_frame()
temp = temp.reset_index()
temp.columns = ['trip_month','DrivingStyle','Count']
# temp
temp['trip_month'].value_counts()
for i in np.sort(temp['DrivingStyle'].unique()):
    trace = go.Bar(x = temp.trip_month[temp.DrivingStyle==i],
                   y = temp.Count[temp.DrivingStyle==i],
                    text = temp.Count[temp.DrivingStyle==i],
                    textposition = 'auto',
                   marker = dict(color=c[np.random.random_integers(N)]),
                   name = str(i))
    plot_data.append(trace)
layout = go.Layout(title = 'Trip Month vs Driving Style',
                  yaxis = dict(title='Count'),
                  xaxis = dict(title='Trip Month'),
                  barmode='stack')
fig = go.Figure(data=plot_data, layout=layout)
iplot(fig)

In the graph the data points for the month 6, 7, 8, 9, 10 are missing as we are concerned with aggressivness in bad weather condition.